In this notebook I'll be attempting to predict the stage 1 feed pressure. I'll compare two classical time series models with 2 machine learning models. The classical time series models are a simple autoregressive univariate model and an ARIMA model with exogenous variables. The machine learning models are a ridge regression and XGBoost model. The models are trained to make forecasts using the direct approach, in which the number of models trained equals the number of says into the future we want to forecast. Results show that the XGBoost model forecast produces the smallest RMSE even with only minimal hyperparameter tuning and feature engineering. Given the problem statement, RMSE is expected to be the most important metric as it focuses on the presence of large errors, which can be extremely costly for industrial equipment.

Import Modules

Load Data

Immediately I'll remove some of the columns that I won't be using to predict stage 1 feed pressure.

For now drop rows where the target variable is NaN. Independent variables will be interpolated.

Data Visualization and EDA

Right skew

Stage 1 feed pressure (S1FP) is most strongly correlated with reject pressure, running time and salt loading since clean/replace. Feed temperature, product flow, and feed flow are all negatively correlated with S1FP.

Will get a closer look at some of these using a scatterplot matrix.

Feed temperature correlates with feed pressure most of the time. But sometimes feed pressure changes a lot even without changes in feed temperature. Something else must be happening. The color scale also shows that there are times where cleaning occurred yet feed pressure was still very high for a few minutes, this could be due to effect related to restarting the system or flushing out. In general though, increased running time led to higher feed pressure.

When running time since clean becomes too high, essentially the membrane gets clogged and pressure increases dramatically. Not sure what causes those two satellite features at high running time since clean where feed pressure remains low.

Next I want to take a look at the timeseries for some of these features that are highly correlated with S1FP.

For ARIMA forecasts, will be able to simply use an increasing value for both of these parameters. Can also include forecasts that assume different max running times before cleaning. I won't have time to implement this now, though.

ARIMA-Type Base-Model Identification

Will run some statistical analyses to determine the best type of classical model to use for this dataset. First I'm going to downsample the dataset to hourly just to make analysis quicker.

First step in deciding whether an autoregression model can be used is to see if the target variable exhibits autocorrelation.

Next I'll plot the autocorrelation function plot to see at what lag values S1FP is autocorrelated.

Strong, statistically significant autocorrelation is exhibited autocorrelation up until about 200 hours in advance. The presence of high autocorrelation for small lags indicates this is a trended time series and thus not stationary. There does not appear to be any seasonality with this dataset. Lack of stationarity means I'll need to difference the timeseries before using any classical methods such as ARIMA.

Next I'll plot the differenced values.

Differencing the timeseries causes it to be slightly overdifferenced.

I'll use the augmented dickey-fuller test to further determine if the time series is stationary. Null hypothesis of this test is that there is the presence of a unit root, meaning we dont have stationarity. If we can reject this null H then we have stationarity.

As indicated with the ACF plot, the ADF test shows shows S1FP is non-stationary. I'll do the same test but for once-differenced data.

Again, differencing the timeseries once causes it to become stationary.

I also want to get a closer look at the partial autocorrelation function to determine the order that is likely to be used in the ARIMA model.

This indicates that when using an AR model, a lag order (p) of 3 or 4 may be most useful.

Split dataset into training and verification

Check for missing data after downsampling to hourly.

For now I will simply drop those times. A more robust approach would be to interpolate them.

There's still some null values on feed_turbitidity. I'll fill those in using backward fill method.

Now I'll split the dataset into training and test. I'm choosing 70% since I want to capture the drastic increase is S1FP that occurred in April.

Simple Univariate Timeseries Model

First I'll make a simple univariate autoregression model. Below I show a plot of the training data in blue and test data in orange.

First I'll construct the model using my own estimates that I arrived at from analysis in the previous section. Then I'll use auto_arima from the pmdarima package to automatically determine the best parameters through AIC minimzation.

Recall from analysis of the ACF and PACF plots that single difference (d=1) and lag order 3 (p=3) seemed appropriate.

The residuals of the trained model show that the for the training data, it fits well and follows the necessary assumptions for autoregression models. Namely that residuals are uncorrelated and have zero mean.

Next I'll make a forecast with 700 timesteps.

Not surprisingly, a univariate ARIMA model doesn't give very great results.

Now I'll run auto_arima to make sure I chose the best model.

Again, the fit model obeys rules for autoregressive models. Residual errors are around mean zero and have uniform variance. ACF plot of residual errors shows they're not correlated.

Basically the same as my model.

ARIMA Model with Exogenous Variables

Here I will run a ARIMAX model fit with exogenous variables. For the time being I'll assume we know the values of the future exo vars when making the forecast. If I have time, I'll make a simple forecast for each of the exo vars which is then fed to this model when making final forecast.

UPDATE: I don't have time. But this would be quite easy to do in the future.

I'll choose relevant exogenous variables. These are also desired as they're easy to model themselves.

Data-Driven Models

In this final section, I'll compare a ridge regression multivariable model with an XGBoost ensemble learner. For both models, I'll use direct method for multistep forecasting. This means I'll need to train as many models as time steps I want to forecast into the future.

I'm using SKLearn's GridSearchCV to aid in hyperparameter tuning, with a walk-forward cross validation approach. With more time, I'd increase the parameters across which GridSearchCV is run, leading to an overall better model. Also, with more time I'd do more feature engineering, such as rolling means of lag variables, categories of membrane age, and affect of pH on membrane degradation. Finally, given more time I'd like to use quartile regrssion to build prediciton intervals for the XGBoost model.

Below are the two functions I'll be using to train the models for each time-step and then save the forecast.

With the models trained, I can plot their results below. The top panel shows a timeseries of observation and forecast predictions from each model. The bottom panel shows scatter plots of observations vs predictions.

Metrics and Conclusion

XGBoost is better than ridge regressor on all metrics calculated. Both models are better than the ARIMA/X models calculated previously (RMSE = 0.74 and MAE=0.59 for the ARIMAX and ARIMA had RMSE = 0.73 and MAE=0.60). MAE for the ARIMA type models is better since it doesn't penalize large errors as much as RMSE. Given the problem we're trying to solve, RMSE is a more important metric. Small errors in prediction aren't going to be very detrimental, but massive errors could cause the model to miss important events, leading to broken equipment and lost time/money.